{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.3 Sampling distributions\n",
"\n",
"In order to use our estimate (the value of the estimator in our sample of data) to make any sort of statement about the true but unknown value of the population parameter, we need to consider questions such as: \n",
"> How precise do we believe our estimate is? \n",
"
Are we fairly certain that the true parameter is close to the estimate, or do we believe the estimate may well be far from the true value? \n",
"\n",
"The following thought experiment might help to develop these ideas. Suppose our population is a large bucket full of identical marbles. We want to know the population mean weight of a marble (our population parameter of interest). To estimate this population mean, we can simply sample a single marble from the bucket. So our estimator is the weight of the single sampled marble. Now suppose we took two samples: we sample a single marble, weigh it, put it back in the bucket, sample another marble and weight that one. In this case, our estimate (the weight of the sampled marble) would be exactly the same as the estimate from the first sample. No matter how many different samples we took, the sample estimate would be identical. In this case, because all possible samples would give us an identical estimate of the mean, we can confidently say what the population mean is using a single sample of one marble.\n",
"\n",
"Now consider a bucket full of different marbles. In this case, randomly sampling a single marble and using the weight of that marble as an estimate of the population mean weight could give us a weight far too large (if we just happened to sample one of the very large marbles) or far too small (if we happened to pick a very small marble). However, if we were to pick 100 marbles and take the sample mean of those 100 marbles as our estimator, we would expect our estimate to be closer to the population mean. If we were to resample another 100 marbles we would expect the sample mean weight to be fairly close to the mean weight of the previous 100 marbles. Conversely, if we took two samples containing one marble each, we might expect those two weights to be quite different from one-another. \n",
"\n",
"This thought experiment makes it clear that in order to use our single sample of data to make statements about a wider population, we need to think about what would happen if we repeated our sampling: if we re-did our study many times, each time calculating the sample estimate, what values would those different sample estimates take? In fact, this is exactly what the **sampling distribution** is. It is the distribution of the **estimator** (the statistic we have chosen to use to estimate the population parameter of interest) under repeated sampling."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3.2 Simulated data: sampling distribution of a mean\n",
"\n",
"We will return again to the emotional distress study. In reality, we do not know the true population mean and standard deviation. However, for the purposes of illustration, for the rest of the session we will imagine that we do know these values. Suppose that, in truth, the population mean age ($\\mu$) is 30 and the population standard deviation (which will will call $\\sigma$) is 4.8. Further, suppose that age follows a normal distribution in the population. \n",
"\n",
"Under this scenario, the following code draws many (10,000) different samples from this population, with each sample containing the ages of 10 people. Note the line `set.seed(1042)` is coded to keep the same pseudo random number starting point. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Ages of the 10 participants selected in study 1:\"\n"
]
},
{
"data": {
"text/html": [
"\n",
"